tv Today in Washington CSPAN May 1, 2013 2:00am-6:01am EDT
about culture than anythingcausa else and i am writing way more about us than it is ha aboutn them i interested of their reaction of my gosh g this is interesting we'remethn suffering more you americanseria did not know this all along? i guess it is something i fi will found -- find out. to a >> we have to stop there. . .??ñ??????ñ????w
one. >> good evening welcome to the commonwealth club of california the place where you are in the no. and the host of technician the design npr did toward and also on the channel on nec's and sirius radio and i am your moderator this evening. tonight is held in association with the commonwealth club science technology forum exploring the distance of the future for science and technology you can find us on the internet commonwealth club.pork or download the android data corporation program information and a pot casper'' and it is my a pleasure to introduce today's guest professor of internet governance and regulation and oxford university and the data editor for the economist of
written the new book big data the revolution to transform how we live work and think and i have the distinct pleasure to interviewing them earlier today for the tech nation broadcast to be aired in the coming weeks and i thought you should know a few other things. professor schomburg has more than 1 degree at only one of which is from harvard he is a lawyer lawyer and has earned a master's in economics from the london school of economics with over 100 academic papers and seven books to his credit i think my favorite title delete if. the value of forgetting in the digital age. his co-author you best known with his career at the economist prior to being the data editor he was with the japan business and finance editor of might also know him as a technology editor
to the asian "wall street journal" in hong kong all very important because big data is not just here in the united states it is global so please welcome our guests [applause] >> thank you very much. it is a pleasure to be here. big data will change how we live, work and think and our journey begins with a story that begins with the flu. every year the winter flu kills tens of thousands of people around the world. but in 2009 and new virus was discovered in experts feared it could kill tens of millions. there was no vaccine available the best help authorities could do was slow the spread but they
needed to know where it was. in the u.s. if the cdc had doctors report new flu cases but collecting the data and analyzing takes time so the cdc picture of the crisis was a week or two behind which is an eternity with the pandemic on the way. around the same time engineers had to go develop the alternative way to predict the spread of the flu not nationally but down two regions in the united states. they used kugel search. it handles more than 3 billion searches per day and saves them all. they took 59 of the most common search terms and compared when and where they were searched for with flu dated going back five years the idea was to predict the spread of the flu through
the web search along. they struck gold. we were looking not right now is a graph showing after crunching through half a billion mathematical models of kubla identified 45 search terms the predicted the spread of the floor with a high degree of accuracy and here you can see the official data of the cdc and the go-go predictive data from a search query but where the cdc had the two weeks reporting lag kugels could spot it almost in realtime. strikingly it did not involve contacting physicians' offices but build off of big david the ability to harness data for novel insights for goods and
services. look at another example. another company in 2003 the computer science professor was taking an airplane and he knew what to do what we all thought we knew to do to bought his ticket well in advance on the day of departure. but at 30,000 feet he could not but help pass the passenger next to him but she paid and sure enough they paid less. he asked another passenger and he also paid less even though they both bought the ticket much later than he had. he was upset. but a computer science professor not only did he get upset he thinks a research said he did not need to know whether their reasons on how to save money if you should buy in advance
if the saturday night stay but he realized the answer was hidden in plain sight that all you needed to know was the price that every other passenger paid on every other airline for every single seat for an entire year. this is a big data problem but it is possible he scraped a little bit and found he could predict with high degree of accuracy whether i price presented online is a good price and you should buy the ticket ride away or whether you should wait and buy it later when the price will go down. he called it hamlet to buy are not to buy. that is the question. [laughter] but a little dated got him a good prediction.
two years later he was cringing 75 billion flight price records almost every single flight in aviation for one year and now the predictions were very good. microsoft knocked on his door and he sold the company for $100 million. but the point* is the data was generated for one purpose the used for another and information is a raw material of business is in new economic input it is tempting to think of big data in terms of size. it is true oh world is awash in with the digital data collected is growing fast doubling almost every three years the trend is obvious when we look at the science when the slow telescope that
began in 2000 gathered to mourn the first few weeks than was a master in the entire history of astronomy. over 10 years the telescope collected astronomy data exceeding 140 terabytes of information but the success of the telescope coming on line in 2016 will acquire that amount of data every five days. and other companies are drowning in debt twitter messages are more than 400 million per day you to more than 800 million monthly user's to upload one hour of video every single second and facebook over 10 million photos are uploaded every hour. coup will processes 100 times the quantity of all
printed material in the u.s. library of congress. the quality of data it is estimated to reach 1.tuz said of fights in which only a small percentage is non digital. so with the silicon valley and described by the footprint so it is more than just about the volume. we suggest speed in reid afford -- to characterize big data. first, today we can collect and analyze for more data of a particular problem than
ever before when we were limited to working with just a small sample. but it is the relative size of data flights through the phenomenon we studied. that is a remarkably clear view and details of a conventional sampling. we could also let the data speak and that reveals insights and never would have thought of. the second part of big data is the embrace of messiness. it emits us to do some of the desire with the ability to measure was limited. we had to retreat what we did bother to quantify as precisely as possible. the rather than going out to measure and collect small
quantities of data with big data will often be satisfied with general direction rather than striving to know the phenomenon the atom or the penny we don't give up entirely we only give up this singular devotion to it reviews the accuracy and the level that we gain an insight. with the change away from the age-old search for causality. instead of asking or looking for elusive relationships in many instances and that is big enough. and that is hard for us
humans to comprehend because we are conditioned maybe even hard wired to understand the world as a series of cause and effect. it is comforting comforting, reassuring. and oftentimes it is plain wrong. if we fall sick after we eat at a new restaurant the hunch is that it is the food although far more likely we got the stomach bug by shaking hands with a colleague. the quick causal hunches we just down the wrong path. with big data we now have an alternative available instead of looking for the causes we can go for correlation to uncover connections and associations between variables that we might not have known otherwise. like ness likes making
predictions and recommendations to customers that is at the heart of the translation service they do not tell us why and they do not know why but a crucial moment and in time for us to act. these features of big data are used to save lives. premature babies are prone to infection. it is important to note infections very early on. how would you do that? in the analog small data world you would take vital signs every couple of hours. oxygenation levels, heartbeat, these types of things. part of a research project in canada they collect 16
real data flow and over 1,000 datapoint is each second then they combine the data to look for patterns or correlations and/or able to spot the onset of the infection 24 hours in advance way before the first symptom would manifest that is incredibly important for these preemies because then they can receive medication will be for the infection is strong and perhaps not battled successfully. perhaps intriguing lead the best predictor for the vital signs is not that they go haywire but they saved a life. we don't know why but we do
know with a small data age to look at the stabilization to save the baby is doing well and i can go home for the night. now we know the baby may be in trouble. it with the fundamental features more with the correlation data is more of the process. the data was so vast it is an extreme form. and the findings would answer was happening but now why. not the biological mechanism at work. . .
then gps and mechanism to do this. and now it's smartphone that we're probably carrying in our pockets. but now our location has been doubt fied. our mobility is doubted all the time. think of books. think of words. in the past we would look up to the temple of delphi and see who mottos etched in stone.
the book was digitized and we had digital words. we get some of the digtyization. we can store and process it. we can't process it per se. we can share it. with a we can't cois analyze it. it was an image files. the words had not been dpowted. what happens if we can take the word and exfact it and treat it as data. what they are doing is looking back at the journal article going past a century. these are hundreds of thousand of articles. and looking for side effects. a human being reading the journals for a century would not be able to spot some of the weird correlations of drug side
effects. but a machine can. big data can. you get from the words. all of you in the audience right now are sitting. think of it in terms of something as fundment tal as posture. the way you are sitting and you are sitting and you and you and you. it's all different. in fact, the way '02 sitting is a function of your weight and the distribution of your weight and leg length and if we were to measure in instrument. it was 100 sensor. the way you sit would be personal. it would like a little bit like a fingerprint. one person sits differently than another. what could we do with this? researchers in tokyo are placing sensors to car seats. it's an antitheft device. suddenly the car would know when someone else is driving it and maybe you would put the control -- if it was happening you would congress out the engine. if you have a teenager, it's
useful to say you're not allowed to drive after 10:00 p.m. the engine doesn't start. okay. that's great. imagine what if 100 million vehicles had this on the roadway today? let think what we can do. perhapses we would be able to identify the telltale signal of the shift in body posture prior to an accident thirty-second prior to an accident. we would data fied driver fatigue and the car might know the service is to alert the driver. maybe the steering wheel would be vibrate. there would be a chime inside the car and know if you have, your body posture will change. those are the things we can do when we use data indication. it's also the core bye product of social media platform.
facebook has datad our froip and the things we like. twitter our thought and whisper. linked in the professional contacts. once things are in data form they can be transformed to something else. a myriad of releases are policy. can that prove interest interesting one. it shifts from the reason it was collected and the immediate uses on the surface to the subsequent uses that may not have been apparent initially but are worth a lot.
think of deliver i are vehicles. ups has 60,000 vans on the road. it needs maintenance. it's a problem that can be fixed with information. when a car breaks down, it doesn't break down all at once. it let you know it. for example you might be driving and it feels funny. there's a strange sound that it normally doesn't have. if you place sensor in the engine what we would be able to do is data some of this. we would be able to measure the vibration or measure the heat. and we can compare that signature with what a normal engine sounds like and what the likely problem is. and suddenly now what we can do, and what ups does to save money is predict the break down. it's called preincompetentive maintenance. they are able to identify when the sensor reading tells us that the heat is going up or the vibration is out of bound of normalcy, you need to bring the van to a sft station and get a tuneup and probably replace a part. they are able to replace a part
before it breaks. the company uses data from 100 million cars to predict traffic flow in cities around the world. by recruising the old data, it is a strong correlation between road traffic and the health of local economy. the business model is to predict how come it takes to go from one place to another. it's a traffic prediction service. they are reusing the data and turning it 0 a new form of economic value. there's a correlation between the road traffic in a city and economic health. but there's more. what investment fund uses the data from the weekend traffic around a large national retailer because it correlates very strongly with the sales. you can see where it's headed. it can measure the road traffic
in the proximity of the store and trade that company's shares prior to the quarter earnings announcement. it has a lens in to whether the sales will increase or decrease. that's data's hayden value. -- hidden value. hidden value of data, big data offers extraordinary benefits. unfortunately, it also has a dark side. as we just heard so much of data's value remains hidden, ready to be unearthed by secondary uses. it puts big data on force how we detect individual privacy. through telling individual at point of collection through notice and consent why we are gathering the data and asking
for consent. but in the big data age we simply do not know when we collect data for what purpose will be using it in the future. so as we reap the benefit of big data our core mechanism of privacy protection is rendered ineffective. in is another dark side. a new problem that emerges. algorithm predicting human behavior, as we are likely do, how we will behave rather than how we have behaved and penalizing us for before we have committed the infraction. and if you think of minority reports, that's exactly right. in a way, that provides value. right. isn't prevention through probability better than punishment after the fact?
and yet such a big data use would be terribly misguided. for start prediction are never -- we would punish people without certainty negating a fundamental tenet of justice. intervening before an action has taken place and punishing the individuals involved in it. we essentially deny them human -- [inaudible] in a world of predictive punishment we never know whether or not somebody will actually committed the crime. let to play out holding people responsible on the basis of big data analysis that can never been disproven. butlet be careful. let be careful here. the culprit is not big data i.t.
the culprit is how we can use it. the crux holding people responsible for actions they have yet to commit is using big data correlations. disabout individual responsibility, the why. as we have explained, big data correlations cannot tell us about the why. the cause casualty behind thing. often it's good enough. it makes big data correlations singularly unfit to decide to punish and hold responsible. the trouble is that we humans are trying to see the world through the lens of causes and effect. that's big data is under constant threat of being abused for cause l purposes. and threaten to imprison us
perhaps literally and probabilities. so what can we do? to begin with, there's no denying of big data's dark side. he can only say reap the benefit of big data if we are exposing the evil and discuss them openly. and we need to think innovatively about how to contain the evil and how to prevent the dark side from taking control. one suggestion is that information priseres sei and the big data range needs to have a modified foundation in this new era privacy control by individual will have to be augmented by direct accountability of the data users. second, and perhaps more importantly on the dangers of punishing people based on prekirks rather than actual behavior we suggest we have to expand our understanding of justice. it's different in the big data
rather than the small. the big require us to enact safe guards for human free will. as much as we currently protect procedural fairness. government must never hold an individual responsible for what their only predicted to do. third, most big data analysis today in going to the future is too complex for the individual defected to comprehend. if we want to protect privacy and protect individualialty in the big data age, we need help. professional help. much like privacy officers of aid in ensuring privacy measures are in place envision a new cast of experts call them ailing rite mist who understand the complexity of big data. the expert in big data analysis and act as reviewers as big data
data and how to control them. there's another challenge. one that is not unique big data but that in the big data age society needs to be extra vigilant to guard against. that's what we call the dictatorship of baa da that. it's the idea we may fetishize the data and meaning and importance than it deserves. as big data starts to play an all area of life. the tendency to place trust in the data and cut off our common sense may only grow. placing one's trust in data without a deep appreciation of what the data means and understanding of the limitations can lead to terrible consequences. in american history, we have experienced a war fought on behalf of a data point. the war with vietnam. and the data point was the body
cap. it was united statessed to measure progress -- it was used to measure progress when was situation was far, far more complex. so in the big data age, it will be critical that we do not follow behindly -- blindly the path that big data seems to say. big data will help us. it's going to help us understand the world better and improve decisions. reremain the master. we need to carve a space for the human for our reason, or imagination. for acting in defiance what the
data says because the data is always just a shadow of reality. and therefore it is always imperfect, always incomplete. as we walk to the big data age, we need to do so with humility and humanity. thank you very much. [applause] wonderful thank you. coauthor of "big data" a. [applause] [applause] knew now it's time for the audience question answer period.
i have a number of questions. if i could ask to have those over on the side. we'll be able to did that. i want to get to everyone's questions. they are all business. [laughter] on top that have it suggests may not think at work. which is my kind of work. [laughter] what is the worse? the negative. the ill, the list goes on and --
[inaudible] >> we talk about the dark side. i mentioned the danger propensity ken mentioned the dataship of data and the privacy challenge. the privacy challenge is one that is severe because the mechanism that we have protect privacy becoming -- we writing the book really thought more that the propensity challenge is one that gets often overlooked. going forward is going to become incredibly important. it schaimtion the role of free will and human position.
in the book suggest a number of possibilities to do that. but that's really what keep me awaking at night. >> well, in a real sense, we're all automatically collecting and december seminating all the dpa -- data. if you have ever been to a hospital you realize you give all the data. not just what you sign on a form. do we have an expectations of privacy big data age? >> well, within in some instances we have to ask the question should we have an expectations of privacy? so let's take health care as an example. we have -- a very cumbersome legal regime that actively blocks the sharing of health care data. you can imagine in 100 years, our children are going to look back on us and be wildered how can let the priceless
information improving care slip away. not just here in america but around the world. what we probably need to do is have a healthy debate and change the narrative entirely and say perhaps as we should make it as a condition as citizenship gets shared. it's true there's a problem. there's a risk of inadvertent disclosure leading to bad consequences. let's look at mechanism to control and police it. learning from the data is a social goal. >> it's interesting because to say we'll maybe deidentified or encrypt that. we know who you are. [laughter] you have dna data on somebody. we know who you are. so in a real sense.
depending on the regulatory regime which you live. the data very rarely gets used for is research in to what could actually help you if you have the condition or how the condition could be prevented. what we need to do is unleash the power of big data on the research side rather than to unleash the power of big data on the cost effectness. we have been big data out. >> in science we say we have 220 people in the study or 60 people in the study or even 500 people in the study. and very few long-term clinical studies with 15 or 120,000 people. with big data we can change the face of science. >> absolutely. it's almost laughable because we were in silicon valley today at facebook.
and we are discussing the pluses and pitfall of big data. you can find video of commonwealth programs online. fora tv. and of course, everybody wants to know, ken, you know, data editor, you're not a data input clerk? >> i'm not a data input clerk. >> now we have a promotion. what is a data editor. >> it's a new title. i'm the first one.
we have been in the game for 160 years. it's nothing new there. we recognize there's new technique we can use data as the basis of stories instead of, if you will, about dote based journalism. you talk to a source and pattern recognition the story through talking to many people. just our sources might lie to us as a journalist. data might lie as well. we have to keep our suspicious up. question crunch lot of numbers to visualize and tell a story. and the data editor is a service provider to the rest of the organization. >> it's a question i forgot to ask today. i can hear in the paper every day google telling us where it
was. apparently google flu over estimated the flu outbreak. what happened? well, first of all, it's a prediction. the prediction tells you that 85 percent the time you're right. that means 50% of the time you are wrong. and so being wrong is just part of being in the prediction game. then, of course, this is a dynamic world in which you need to rerun your model all the time. because if cnn reports on flu trends or reports on the flu sb. people might get the flu even though they don't have it. there's a feedback mechanism in place. and google flu trend being compared to center for disease control data. maybe the flu in the center for disease control data rather than the google data. we don't know. and so what we should not do is to immediately create a -- [inaudible] and say it must be because
google's model is wrong. that's dangerous. we shouldn't do that. when we look at the spike we should investigate with an open mind. i would like to underline. when google first did a fitting of the model we were not in a perception. people were not going to the doctor because they feel they can't take a day off work or can't afford if. flu trends might be more accurate in temple of the outbreak of the flu and the cbc data may have more
availability. i'll introduce you to the cbc. they'll be delighted to hear it. [laughter] once you know something is being collected you know that. it's why double blind studies. i find out if i have a flu symptom. we don't know. once it becomes public how people change. the data fa collected is there. that's a problem as well as massive point you have is that if cbc only recording the people that duoto doctors that's changed dramatically even with the internet even with the internet. so we have to really be good at this new big data role of an lettic. for some people, the ailing
when you have that someone coming with the algorithm how we look at it and account for what we are tossing around here. it's a dynamic kind of thing. you speak about this new job category, if you will. what is the qualifiers. what goes in to this? what does one need? that's a greating question. the coming generation will need to know how to collect data, scrape it off the internet, put it in storage. perhapses no the in the old fashioned structured way but a more unstructured storage we seen today. then they need to look at the data and analyze it. they need to use physical packages. they need to use networking analysis. there's a variety of tools and
methods available. they might need good grounding in the latest of statistics. a lot of staingt call maineds we use were designed for small data range. it might be need to upgrade or improve them to an extent. and then they might also need some sen of visualizing if we go the big data. and in addition to all of that, we would like to view them with a grounding of not just mathematic but philosophy. a more general theory. oftentimes people who are doing very well are those that come from the natural science. particularly physicists who are well trained to deal with huge amounts of data. either through the astronomy,
through telescope and data gathering there or take -- that is the kind of mixed interdisciplinary and mix we need. and unfortunely the few university around the world have programs yet to educate them. i hope that is going to change. >> now we have traditional statistics, which i county do very well. probably in the alone. we remember this. do we have new statistic tal technique for big data? are they emerging in i.t.? >> yes, they are. and many ways we're looking at photoadvancement of right now the characteristic call statistical approach to look for linear. linear relationship.
if increases b. will increase or decrease in the same way. but a lot of times that's not the case. it's much more complex that relationship might be more different than that. we need some advance. we need insight. better ways to measure the thickness of a model two data. today they run around and talk about the square and how well particular model fits can da that in the big data world. we need to upgrade a lot of these tools. these method we have available. it doesn't mean they are bad. there's room for improvement. >> is it possible -- what can we regulate with respect to big data and what can't we? you throw the curve ball
question. i think with what we need to do is make sure that we are not striefling innovation. we need to focus on the risk we talk about the privacy change. we talk about property pencety challenge and dictatorship of data challenge. we need to find pragmatic solutions and safe guards in short that society and the individuals are going to be protected. what can you not regulate? >> it's a not very good answer.
right now if we go to a doctor and told we have to have an operation. we can ask the doctor why and the doctor can tell us. i learned it in medical school and these are the features why i need the operation. he can pientd to something. the benefit of the instrumentation. we would ask the doctor why the operation. and the risk is dlat might say i don't know. you can also say this more generally. you may ask the tbhak denies you a loan why was i denied a loan. you say it's because of the credit rating. what if we look at thousand variable and what if all of those there was 00 strong
signals and the long tail of -- [inaudible] all of in in a dplaicted formula tailored to the individual that was also changing over the time the reason why the idea of social responsibility society and government. i don't think you can break the two apart. roughly a three quarter medical people. ten million people will go that far. which the city of san francisco do priority one, two, three about big data. >> simple. the first thing you need -- >> i love you say simple.
>> first san fransisco in the leadership position in the united. woe should applaud . >> in what way? i department know that. >> yeah. well, there was a gentleman, i want to say the name chris vine who comes to mind the cpo of the san francisco i believe works at the white house the public transport data so they can build apps alongside it. he build those and bring the developers together. and san francisco is actually doing very good things. with the gemstone is in the united states in new york city there they have a director of an lettic. san francisco might want to look at the model. what he's done the fellow created a small little team to
act as a service provider to all the other administrator agencies in the city. we don't know which build agency the the outset are the most risk of fire. versus ones that a problem. we get 60,000 complaints a year to our help line. we only have 200 inspectors. how can data help us? he built a model he's brought in all of the data from other agencies and look at balance visits. utility cut or exterior brick
work done. in the inspector goes in the past they issue a vacate order to get rid of the building. now they do it in 70% of the visits. everybody lovers it. it's doing more for lessed at the age of austerity. if we can get the right data and to the right people we can make big difference. i get chagrined and i'm sure other people do. statistically people we have to talk to twelve people and totally describes 12 million. and this is the kind of thick if i adopt believe you can twelve people. what is the argument there? is it going to get worse?
>> well, in a way if i may take your question as a slightly larger context. in the way get to the heart of what big data is. and in a way in the small data age, the way we approached problem selfing, decision making was we because we were starved for data we could come up with a theory how it works. proved or disprove the hypothesis. and then it would go back and change a little bit the high hypothesis. try all over again to collect the data.
it worked reasonably well and the small data. it's an art fact of the small. we can use be big data to use high pot cyst and test it . take a example of google. when they try to find out 50 million which were the best to predict the spread of the flu. they had no clue which of the 50 million to pick. they would have to pick the first every time sample e again.
it's crazy. and what is the exact combination? that's crazy to do that. what you want to do is have a mained by which you can create a way of producing hypothesis and the way we're using big data analysis not just to tell us whether we are right or wrong if you help us come up with the hypothesis. they can check them at the same time. they have a new strategy coming up. >> yeah. what they are doing is take the 50 million most common search terms and essentially try each one to see for the proof of the model. when they find one that is good with the model they try out. elephant or sniffles?
>> nope. in the top 100 terms was the term high school basketball. it's played in the winter time. there's a correlation. keep in mind it was deep in the 60 or 70. the model try the 44th term. it was good. and whatever you -- the new data. no problem. and keep in mind there's -- [inaudible] if you get new data you rewound
round the whole thing. it gets better over time. you don't think at any one point in time you have the data and there's the answer. i think that's part of it. we are moving ahead now i didn't come up with the question. given the field the growing and changing so fast estimate the shelf life of your book. [laughter] it's timeless.
it's the first book out of the gate to dwient trend. it's not going to effect business. it's not going it effect government or health care. it's going to effect everything. it's like computing the 19 50s if you think computing is going to go next? what industries will be useful the person would have to honestly answer it's not the right question. because by the year 2013, computers will have wormed their way to everything until they are almost invisible. it's going learn from data and be data fying things and learning from big data. the way we have self-depriving cars is not because we can computer program a compute per chemical weapon pour in a lot of data and let the statistics and the machine teach itself.
where not a presentive. it's one example, large data like the dna of people in the room. all the dna generated by the complete genome and the new national cancerth. we're talk abouting our large complex data. you can't look tat and in the next excel spread sheet probably data. if big data is about correlation and not cause, how can you judge . >> i think to it's important unthe l